AV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition
نویسندگان
چکیده
This paper describes the acquisition of the multichannel multimodal database AV@CAR for automatic audio-visual speech recognition in cars. Automatic speech recognition (ASR) plays an important role inside vehicles to keep the driver away from distraction. It is also known that visual information (lip-reading) can improve accuracy in ASR under adverse conditions as those within a car. The corpus described here is intended to provide training and testing material for several classes of audiovisual speech recognizers including isolated word system, word-spotting systems, vocabulary independent systems, and speaker dependent or speaker independent systems for a wide range of applications. The audio database is composed of seven audio channels including, clean speech (captured using a close talk microphone), noisy speech from several microphones placed on the overhead of the cabin, noise only signal coming from the engine compartment and information about the speed of the car. For the video database, a small video camera sensible to the visible and the near infrared bands is placed on the windscreen and used to capture the face of the driver. This is done under different light conditions both during the day and at night. Additionally, the same individuals are recorded in laboratory, under controlled environment conditions to obtain noise free speech signals, 2D images and 3D + texture face models.
منابع مشابه
A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English
Human perception of the world is inherently multi-sensory because the information provided is multimodal. The perception of spoken language is no exception. Beside the auditory information, there is visual speech information as well, provided by the facial movements as a result of moving the articulators during speech production. Visual speech information contributes to speech perception in all...
متن کاملCzech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition
This paper presents the design of an audio-visual speech corpus for in-vehicle audio-visual speech recognition. Throughout the world, there exist several audio-visual speech corpora. There are also several (audio-only) speech corpora for in-vehicle recognition. So far, we have not found an audiovisual speech corpus for in-vehicle speech recognition. And, we have not found any audio-visual speec...
متن کاملCENSREC-AV: evaluation frameworks for audio-visual speech recognition
This paper introduces incoming evaluation frameworks for bimodal speech recognition in noisy conditions and real environments. In order to develop a robust speech recognition in noisy environments, bimodal speech recognition which uses acoustic and visual information has been paid attention to particularly for this decade. As a lot of methods and techniques for bimodal speech recognition have b...
متن کاملNTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition
Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT ...
متن کاملAudio-Video Automatic Speech Recognition: An Example of Improved Performance through Multimodal Sensor Input
One of the advantages of multimodal HCI technology is the performance improvement that can be gained over conventional single-modality technology by employing complementary sensors in different modalities. Such information is particular useful in practical, real-world applications where the application’s performance must be robust against all kinds of noise. An example is the domain of automati...
متن کامل